Background¶

About Dataset The DARWIN dataset includes handwriting data from 174 participants. The classification task consists in distinguishing Alzheimer’s disease patients from healthy people.

Creator: Francesco Fontanella

Source: https://archive.ics.uci.edu/dataset/732/darwin

The DARWIN dataset was created to allow researchers to improve the existing machine-learning methodologies for the prediction of Alzheimer's disease via handwriting analysis.

Citation Requests/Acknowledgements

N. D. Cilia, C. De Stefano, F. Fontanella, A. S. Di Freca, An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science 141 (2018) 466–471. https://doi.org/10.1016/j.procs.2018.10.141

N. D. Cilia, G. De Gregorio, C. De Stefano, F. Fontanella, A. Marcelli, A. Parziale, Diagnosing Alzheimer’s disease from online handwriting: A novel dataset and performance benchmarking, Engineering Applications of Artificial Intelligence, Vol. 111 (20229) 104822. https://doi.org/10.1016/j.engappai.2022.104822

Protocol: The researchers developed a protocol consisting of 25 handwriting tasks designed to assess different aspects of cognitive and motor function potentially affected by AD. These tasks fall into three categories: Graphic, Copy, and Memory.

Data Acquisition: They collected data from 174 participants (89 AD patients and 85 healthy controls) using a Wacom Bamboo tablet, recording pen tip movements and pressure.

Feature Extraction: From the raw data, they extracted 18 features per task, encompassing measures of time, speed, acceleration, jerk, pressure, and spatial characteristics.

The target variable they are trying to predict is whether a participant has Alzheimer's disease (A) or healthy (H).¶

P: Stands for "Patients", referring to individuals diagnosed with Alzheimer's Disease.
H: Stands for "Healthy", referring to individuals who are not diagnosed with Alzheimer's Disease and serve as a control group.

Description Category: 25 tasks¶

1 Signature drawing M
2 Join two points with a horizontal line, continuously for four times G
3 Join two points with a vertical line, continuously for four times G
4 Retrace a circle (6 cm of diameter) continuously for four times G
5 Retrace a circle (3 cm of diameter) continuously for four times G
6 Copy the letters ‘l’, ‘m’ and ‘p’ C
7 Copy the letters on the adjacent rows C
8 Write cursively a sequence of four lowercase letter ‘l’, in a single smooth movement C
9 Write cursively a sequence of four lowercase cursive bigram ‘le’, in a single smooth movement C
10 Copy the word ‘‘foglio’’ C
11 Copy the word ‘‘foglio’’ above a line C
12 Copy the word ‘‘mamma’’ C
13 Copy the word ‘‘mamma’’ above a line C
14 Memorize the words ‘‘telefono’’, ‘‘cane’’, and ‘‘negozio’’ and rewrite them M
15 Copy in reverse the word ‘‘bottiglia’’ C
16 Copy in reverse the word ‘‘casa’’ C
17 Copy six words (regular, non regular, non words) in the appropriate boxes C
18 Write the name of the object shown in a picture (a chair) M
19 Copy the fields of a postal order C
20 Write a simple sentence under dictation M
21 Retrace a complex form G
22 Copy a telephone number C
23 Write a telephone number under dictation M
24 Draw a clock, with all hours and put hands at 11:05 (Clock Drawing Test) G
25 Copy a paragraph C

For each task, from the raw data, i.e. (x,y)-coordinates, pressure and timestamp, we extracted 18 features, detailed in the following.¶

  1. Time Features:

    Total Time (TT): Overall task duration.

    Air Time (AT): Time spent with the pen in the air.

    Paper Time (PT): Time spent writing on the paper.

  1. Speed Features:

    Mean Speed on-paper (MSP): Average speed of writing on paper.

    Mean Speed in-air (MSA): Average speed of pen movement in the air.

  2. Movement Smoothness Features:

    Mean Acceleration on-paper (MAP): Average acceleration of writing on paper.

    Mean Acceleration in-air (MAA): Average acceleration of pen movement in the air.

    Mean Jerk on-paper (MJP): Average jerk (change in acceleration) of writing on paper.

    Mean Jerk in-air (MJA): Average jerk of pen movement in the air.

  3. Pressure Features:

    Pressure Mean (PM): Average pressure exerted by the pen on the paper.

    Pressure Var (PV): Variance (fluctuation) of the pressure exerted by the pen.

  4. Global Mean Relative Tremor (GMRT) Features:

    GMRT on-paper (GMRTP): Measure of tremor during writing on paper.

    GMRT in-air (GMRTA): Measure of tremor during in-air movements.

    Mean GMRT (GMRT): Average of GMRTP and GMRTA.

  5. Other Features:

    Pendowns Number (PWN): Number of times the pen touches the paper.

    Max X Extension (XE): Maximum horizontal distance covered by writing.

    Max Y Extension (YE): Maximum vertical distance covered by writing.

    Dispersion Index (DI): Measure of how much of the paper is used for writing.

Import neccessary libraries¶

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import chi2

Read data¶

In [5]:
file_path = 'data.csv'
df = pd.read_csv(file_path)
df= df.drop(['ID'], axis=1)
In [6]:
df.head()
Out[6]:
air_time1 disp_index1 gmrt_in_air1 gmrt_on_paper1 max_x_extension1 max_y_extension1 mean_acc_in_air1 mean_acc_on_paper1 mean_gmrt1 mean_jerk_in_air1 ... mean_jerk_in_air25 mean_jerk_on_paper25 mean_speed_in_air25 mean_speed_on_paper25 num_of_pendown25 paper_time25 pressure_mean25 pressure_var25 total_time25 class
0 5160 0.000013 120.804174 86.853334 957 6601 0.361800 0.217459 103.828754 0.051836 ... 0.141434 0.024471 5.596487 3.184589 71 40120 1749.278166 296102.7676 144605 P
1 51980 0.000016 115.318238 83.448681 1694 6998 0.272513 0.144880 99.383459 0.039827 ... 0.049663 0.018368 1.665973 0.950249 129 126700 1504.768272 278744.2850 298640 P
2 2600 0.000010 229.933997 172.761858 2333 5802 0.387020 0.181342 201.347928 0.064220 ... 0.178194 0.017174 4.000781 2.392521 74 45480 1431.443492 144411.7055 79025 P
3 2130 0.000010 369.403342 183.193104 1756 8159 0.556879 0.164502 276.298223 0.090408 ... 0.113905 0.019860 4.206746 1.613522 123 67945 1465.843329 230184.7154 181220 P
4 2310 0.000007 257.997131 111.275889 987 4732 0.266077 0.145104 184.636510 0.037528 ... 0.121782 0.020872 3.319036 1.680629 92 37285 1841.702561 158290.0255 72575 P

5 rows × 451 columns

In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Columns: 451 entries, air_time1 to class
dtypes: float64(300), int64(150), object(1)
memory usage: 613.2+ KB
In [8]:
df.describe()
Out[8]:
air_time1 disp_index1 gmrt_in_air1 gmrt_on_paper1 max_x_extension1 max_y_extension1 mean_acc_in_air1 mean_acc_on_paper1 mean_gmrt1 mean_jerk_in_air1 ... mean_gmrt25 mean_jerk_in_air25 mean_jerk_on_paper25 mean_speed_in_air25 mean_speed_on_paper25 num_of_pendown25 paper_time25 pressure_mean25 pressure_var25 total_time25
count 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 ... 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 174.000000 1.740000e+02
mean 5664.166667 0.000010 297.666685 200.504413 1977.965517 7323.896552 0.416374 0.179823 249.085549 0.067556 ... 221.360646 0.148286 0.019934 4.472643 2.871613 85.839080 43109.712644 1629.585962 163061.767360 1.642033e+05
std 12653.772746 0.000003 183.943181 111.629546 1648.306365 2188.290512 0.381837 0.064693 132.698462 0.074776 ... 63.762013 0.062207 0.002388 1.501411 0.852809 27.485518 19092.024337 324.142316 56845.610814 4.969397e+05
min 65.000000 0.000002 28.734515 29.935835 754.000000 561.000000 0.067748 0.096631 41.199445 0.011861 ... 69.928033 0.030169 0.014987 1.323565 0.950249 32.000000 15930.000000 474.049462 26984.926660 2.998000e+04
25% 1697.500000 0.000008 174.153023 136.524742 1362.500000 6124.000000 0.218209 0.146647 161.136182 0.029523 ... 178.798382 0.107732 0.018301 3.485934 2.401199 66.000000 32803.750000 1499.112088 120099.046800 5.917500e+04
50% 2890.000000 0.000009 255.791452 176.494494 1681.000000 6975.500000 0.275184 0.163659 224.445268 0.039233 ... 217.431621 0.140483 0.019488 4.510578 2.830672 81.000000 37312.500000 1729.385010 158236.771800 7.611500e+04
75% 4931.250000 0.000011 358.917885 234.052560 2082.750000 8298.500000 0.442706 0.188879 294.392298 0.071057 ... 264.310776 0.199168 0.021134 5.212794 3.335828 101.500000 46533.750000 1865.626974 200921.078475 1.275425e+05
max 109965.000000 0.000028 1168.328276 865.210522 18602.000000 15783.000000 2.772566 0.627350 836.784702 0.543199 ... 437.373267 0.375078 0.029227 10.416715 5.602909 209.000000 139575.000000 1999.775983 352981.850000 5.704200e+06

8 rows × 450 columns

In [9]:
int_columns = df.select_dtypes(include='int').columns
float_columns = df.select_dtypes(include='float').columns
object_columns = df.select_dtypes(include='object').columns
In [10]:
print("Integer columns:", int_columns)
print("Float columns:", float_columns)
print("Object columns:", object_columns)
Integer columns: Index(['air_time1', 'max_x_extension1', 'max_y_extension1', 'num_of_pendown1',
       'paper_time1', 'total_time1', 'air_time2', 'max_x_extension2',
       'max_y_extension2', 'num_of_pendown2',
       ...
       'max_y_extension24', 'num_of_pendown24', 'paper_time24', 'total_time24',
       'air_time25', 'max_x_extension25', 'max_y_extension25',
       'num_of_pendown25', 'paper_time25', 'total_time25'],
      dtype='object', length=150)
Float columns: Index(['disp_index1', 'gmrt_in_air1', 'gmrt_on_paper1', 'mean_acc_in_air1',
       'mean_acc_on_paper1', 'mean_gmrt1', 'mean_jerk_in_air1',
       'mean_jerk_on_paper1', 'mean_speed_in_air1', 'mean_speed_on_paper1',
       ...
       'gmrt_on_paper25', 'mean_acc_in_air25', 'mean_acc_on_paper25',
       'mean_gmrt25', 'mean_jerk_in_air25', 'mean_jerk_on_paper25',
       'mean_speed_in_air25', 'mean_speed_on_paper25', 'pressure_mean25',
       'pressure_var25'],
      dtype='object', length=300)
Object columns: Index(['class'], dtype='object')

Features¶

In [11]:
#categorical_columns = list(X.select_dtypes(include=['object', 'category']).columns)
#numerical_columns = [col for col in X.columns if col not in categorical_columns]
#print("Total columns: {}. Categorical columns {}. Numerical columns {}".format(len(X.columns), len(categorical_columns), len(numerical_columns)))

1. EDA¶

Target variable¶

In [12]:
target_counts = df['class'].value_counts()

print(target_counts)

# Plot the bar chart
plt.figure(figsize=(10, 6))
target_counts.plot(kind='bar')
plt.xlabel('Alzheimers patient Class')
plt.ylabel('Count')
plt.title('Distribution of Alzheimers patient Class')
plt.xticks(rotation=45)
plt.show()
class
P    89
H    85
Name: count, dtype: int64

Numerical variables¶

The dataset was collected when the patient performed 25 tasks, each task has the same 18 features. So for learning purpose, we will perform the EDA on 1 task for the illustration.

In [13]:
# Subset 18 features of task 1

num_subset = df.iloc[:,0:18]

# Subset 18 features of task 1 and task 2

num_subset2 = df.iloc[:,0:36]

# Subset 18 features of task 1 , task 2 and task 3

num_subset3 = df.iloc[:,0:54]
In [14]:
num_subset.head()
Out[14]:
air_time1 disp_index1 gmrt_in_air1 gmrt_on_paper1 max_x_extension1 max_y_extension1 mean_acc_in_air1 mean_acc_on_paper1 mean_gmrt1 mean_jerk_in_air1 mean_jerk_on_paper1 mean_speed_in_air1 mean_speed_on_paper1 num_of_pendown1 paper_time1 pressure_mean1 pressure_var1 total_time1
0 5160 0.000013 120.804174 86.853334 957 6601 0.361800 0.217459 103.828754 0.051836 0.021547 1.828076 1.493242 22 10730 1679.232060 288285.0449 15890
1 51980 0.000016 115.318238 83.448681 1694 6998 0.272513 0.144880 99.383459 0.039827 0.016885 1.817744 1.517763 11 12460 1723.171348 210516.6356 64440
2 2600 0.000010 229.933997 172.761858 2333 5802 0.387020 0.181342 201.347928 0.064220 0.020126 3.378343 3.308866 10 6080 1520.253289 120845.8717 8680
3 2130 0.000010 369.403342 183.193104 1756 8159 0.556879 0.164502 276.298223 0.090408 0.021150 5.082499 3.542645 10 5595 1913.995532 100286.6032 7725
4 2310 0.000007 257.997131 111.275889 987 4732 0.266077 0.145104 184.636510 0.037528 0.018590 3.804656 2.180544 8 4080 1819.121324 160061.8198 6390
In [15]:
num_cols = num_subset.columns

Historgram¶

In [16]:
# Plot histograms
plt.figure(figsize=(20, 20))
for i, col in enumerate(num_cols, 1):
    plt.subplot(6, 4, i)  # Adjust the number of rows and columns as needed
    num_subset[col].dropna().hist(bins=30)  # Drop NaN values for histogram
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

Pair plots¶

In [17]:
sns.pairplot(df[num_cols.tolist() + ['class']], hue='class')

plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

Box plots¶

In [18]:
plt.figure(figsize=(20, 20))
for i, col in enumerate(num_cols, 1):
    plt.subplot(6, 4, i)  # Adjust the number of rows and columns as needed
    sns.boxplot(x=num_subset[col].dropna())
    plt.title(col)
    plt.xlabel(col)

plt.tight_layout()
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)

There are a lot of outliers.

Correlations Greater Than 0.7¶

In [19]:
correlation_matrix = num_subset.corr()

# Filter correlations larger than |0.7|
high_correlation = correlation_matrix[(correlation_matrix > 0.7) | (correlation_matrix < -0.7)]
# Set diagonal and lower triangle to NaN to avoid duplication
for i in range(len(high_correlation)):
    for j in range(i+1):
        high_correlation.iat[i, j] = None

# Drop rows and columns with all NaN values
high_correlation = high_correlation.dropna(how='all', axis=0).dropna(how='all', axis=1)

high_correlation
Out[19]:
mean_acc_on_paper1 mean_gmrt1 mean_jerk_in_air1 mean_jerk_on_paper1 mean_speed_in_air1 mean_speed_on_paper1 paper_time1 total_time1
air_time1 NaN NaN NaN NaN NaN NaN NaN 0.972902
disp_index1 NaN NaN NaN NaN NaN NaN 0.808375 NaN
gmrt_in_air1 NaN 0.940325 NaN NaN 0.826749 NaN NaN NaN
gmrt_on_paper1 0.875478 0.828012 NaN 0.738690 NaN 0.985557 NaN NaN
mean_acc_in_air1 NaN NaN 0.988005 NaN NaN NaN NaN NaN
mean_acc_on_paper1 NaN NaN NaN 0.886071 NaN 0.896524 NaN NaN
mean_gmrt1 NaN NaN NaN NaN 0.863950 0.822041 NaN NaN
mean_jerk_on_paper1 NaN NaN NaN NaN NaN 0.749382 NaN NaN
num_of_pendown1 NaN NaN NaN NaN NaN NaN 0.726058 NaN
paper_time1 NaN NaN NaN NaN NaN NaN NaN 0.757173
In [20]:
correlation_matrix = num_subset2.corr()

# Filter correlations larger than |0.7|
high_correlation = correlation_matrix[(correlation_matrix > 0.7) | (correlation_matrix < -0.7)]
# Set diagonal and lower triangle to NaN to avoid duplication
for i in range(len(high_correlation)):
    for j in range(i+1):
        high_correlation.iat[i, j] = None

# Drop rows and columns with all NaN values
high_correlation = high_correlation.dropna(how='all', axis=0).dropna(how='all', axis=1)

high_correlation
Out[20]:
mean_acc_on_paper1 mean_gmrt1 mean_jerk_in_air1 mean_jerk_on_paper1 mean_speed_in_air1 mean_speed_on_paper1 paper_time1 total_time1 mean_gmrt2 mean_jerk_in_air2 mean_jerk_on_paper2 mean_speed_on_paper2 num_of_pendown2 paper_time2 total_time2
air_time1 NaN NaN NaN NaN NaN NaN NaN 0.972902 NaN NaN NaN NaN NaN NaN NaN
disp_index1 NaN NaN NaN NaN NaN NaN 0.808375 NaN NaN NaN NaN NaN NaN NaN NaN
gmrt_in_air1 NaN 0.940325 NaN NaN 0.826749 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
gmrt_on_paper1 0.875478 0.828012 NaN 0.738690 NaN 0.985557 NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean_acc_in_air1 NaN NaN 0.988005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean_acc_on_paper1 NaN NaN NaN 0.886071 NaN 0.896524 NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean_gmrt1 NaN NaN NaN NaN 0.863950 0.822041 NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean_jerk_on_paper1 NaN NaN NaN NaN NaN 0.749382 NaN NaN NaN NaN NaN NaN NaN NaN NaN
num_of_pendown1 NaN NaN NaN NaN NaN NaN 0.726058 NaN NaN NaN NaN NaN NaN NaN NaN
paper_time1 NaN NaN NaN NaN NaN NaN NaN 0.757173 NaN NaN NaN NaN NaN NaN NaN
air_time2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.878915 0.730840 0.964098
disp_index2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.767908 NaN
gmrt_in_air2 NaN NaN NaN NaN NaN NaN NaN NaN 0.932219 NaN NaN NaN NaN NaN NaN
gmrt_on_paper2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.896537 NaN NaN NaN
mean_acc_in_air2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.994576 NaN NaN NaN NaN NaN
mean_acc_on_paper2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.818415 NaN NaN NaN NaN
num_of_pendown2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.869366
paper_time2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.885851

Some of the 18 features in Task 1 are highly correlated with each other, but none of the 18 features in Task 1 are highly correlated with the features in Tasks 2, 3,... or 25.

The features are highly correlated within each task only.

In [31]:
# Plot the high correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(high_correlation, annot=True, cmap='coolwarm', vmin=-1, vmax=1, linewidths=0.5, linecolor='gray')
plt.title('High Correlation Heatmap (|correlation| > 0.7)')
plt.show()

Calculate the Correlation Matrix and Identify Highly Correlated Pairs winthin each task¶

In [21]:
tasks = {
    "Task1": df.iloc[:, 0:18],
    "Task2": df.iloc[:, 18:36],
    "Task3": df.iloc[:, 36:54],
    "Task4": df.iloc[:, 54:72],
    "Task5": df.iloc[:, 72:90],
    "Task6": df.iloc[:, 90:108],
    "Task7": df.iloc[:, 108:126],
    "Task8": df.iloc[:, 126:144],
    "Task9": df.iloc[:, 144:162],
    "Task10": df.iloc[:, 162:180],
    "Task11": df.iloc[:, 180:198],
    "Task12": df.iloc[:, 198:216],
    "Task13": df.iloc[:, 216:234],
    "Task14": df.iloc[:, 234:252],
    "Task15": df.iloc[:, 252:270],
    "Task16": df.iloc[:, 270:288],
    "Task17": df.iloc[:, 288:306],
    "Task18": df.iloc[:, 306:324],
    "Task19": df.iloc[:, 324:342],
    "Task20": df.iloc[:, 342:360],
    "Task21": df.iloc[:, 360:378],
    "Task22": df.iloc[:, 378:396],
    "Task23": df.iloc[:, 396:414],
    "Task24": df.iloc[:, 414:432],
    "Task25": df.iloc[:, 432:450]
}

# Dictionary to store the high correlation pairs for each task
high_corr_pairs_dict = {}

for task_name, task_data in tasks.items():
    # Calculate the correlation matrix
    correlation_matrix = task_data.corr()
    
    # Filter correlations larger than |0.7|
    high_correlation = correlation_matrix[(correlation_matrix > 0.7) | (correlation_matrix < -0.7)]
    
    # Set diagonal and lower triangle to NaN to avoid duplication
    for i in range(len(high_correlation)):
        for j in range(i + 1):
            high_correlation.iat[i, j] = None
    
    # Drop rows and columns with all NaN values
    high_correlation = high_correlation.dropna(how='all', axis=0).dropna(how='all', axis=1)
    
    # List the pairs that are highly correlated
    high_corr_pairs = high_correlation.stack().reset_index()
    high_corr_pairs.columns = ['Feature1', 'Feature2', 'Correlation']
    
    # Store the result in the dictionary
    high_corr_pairs_dict[task_name] = high_corr_pairs

# Print the high correlation pairs for each task
for task_name, high_corr_pairs in high_corr_pairs_dict.items():
    if not high_corr_pairs.empty:
        print(f"Highly Correlated Feature Pairs in {task_name}:")
        print(high_corr_pairs[['Feature1', 'Feature2', 'Correlation']])
        print("\n")
Highly Correlated Feature Pairs in Task1:
               Feature1              Feature2  Correlation
0             air_time1           total_time1     0.972902
1           disp_index1           paper_time1     0.808375
2          gmrt_in_air1            mean_gmrt1     0.940325
3          gmrt_in_air1    mean_speed_in_air1     0.826749
4        gmrt_on_paper1    mean_acc_on_paper1     0.875478
5        gmrt_on_paper1            mean_gmrt1     0.828012
6        gmrt_on_paper1   mean_jerk_on_paper1     0.738690
7        gmrt_on_paper1  mean_speed_on_paper1     0.985557
8      mean_acc_in_air1     mean_jerk_in_air1     0.988005
9    mean_acc_on_paper1   mean_jerk_on_paper1     0.886071
10   mean_acc_on_paper1  mean_speed_on_paper1     0.896524
11           mean_gmrt1    mean_speed_in_air1     0.863950
12           mean_gmrt1  mean_speed_on_paper1     0.822041
13  mean_jerk_on_paper1  mean_speed_on_paper1     0.749382
14      num_of_pendown1           paper_time1     0.726058
15          paper_time1           total_time1     0.757173


Highly Correlated Feature Pairs in Task2:
             Feature1              Feature2  Correlation
0           air_time2       num_of_pendown2     0.878915
1           air_time2           paper_time2     0.730840
2           air_time2           total_time2     0.964098
3         disp_index2           paper_time2     0.767908
4        gmrt_in_air2            mean_gmrt2     0.932219
5      gmrt_on_paper2  mean_speed_on_paper2     0.896537
6    mean_acc_in_air2     mean_jerk_in_air2     0.994576
7  mean_acc_on_paper2   mean_jerk_on_paper2     0.818415
8     num_of_pendown2           total_time2     0.869366
9         paper_time2           total_time2     0.885851


Highly Correlated Feature Pairs in Task3:
             Feature1              Feature2  Correlation
0           air_time3           total_time3     0.847516
1         disp_index3           paper_time3     0.842708
2        gmrt_in_air3            mean_gmrt3     0.933240
3      gmrt_on_paper3  mean_speed_on_paper3     0.899824
4    mean_acc_in_air3     mean_jerk_in_air3     0.996660
5  mean_acc_on_paper3   mean_jerk_on_paper3     0.802542
6     num_of_pendown3         pressure_var3     0.719333
7         paper_time3           total_time3     0.814712


Highly Correlated Feature Pairs in Task4:
             Feature1              Feature2  Correlation
0           air_time4           total_time4     0.936848
1        gmrt_in_air4            mean_gmrt4     0.944949
2        gmrt_in_air4    mean_speed_in_air4     0.776363
3      gmrt_on_paper4  mean_speed_on_paper4     0.988750
4    max_x_extension4      max_y_extension4     0.770221
5    mean_acc_in_air4     mean_jerk_in_air4     0.996134
6    mean_acc_in_air4    mean_speed_in_air4     0.762881
7  mean_acc_on_paper4   mean_jerk_on_paper4     0.869096
8          mean_gmrt4    mean_speed_in_air4     0.788678
9   mean_jerk_in_air4    mean_speed_in_air4     0.746192


Highly Correlated Feature Pairs in Task5:
             Feature1              Feature2  Correlation
0           air_time5           total_time5     0.841695
1         disp_index5      max_y_extension5     0.850136
2        gmrt_in_air5            mean_gmrt5     0.881580
3      gmrt_on_paper5  mean_speed_on_paper5     0.990230
4    max_x_extension5      max_y_extension5     0.707088
5    mean_acc_in_air5     mean_jerk_in_air5     0.969452
6  mean_acc_on_paper5   mean_jerk_on_paper5     0.841853
7         paper_time5           total_time5     0.905422


Highly Correlated Feature Pairs in Task6:
              Feature1              Feature2  Correlation
0            air_time6           total_time6     0.984177
1          disp_index6           paper_time6     0.804206
2          disp_index6           total_time6     0.718099
3         gmrt_in_air6      mean_acc_in_air6     0.728092
4         gmrt_in_air6            mean_gmrt6     0.859972
5         gmrt_in_air6     mean_jerk_in_air6     0.728523
6         gmrt_in_air6    mean_speed_in_air6     0.948211
7       gmrt_on_paper6    mean_acc_on_paper6     0.705548
8       gmrt_on_paper6            mean_gmrt6     0.777139
9       gmrt_on_paper6  mean_speed_on_paper6     0.972472
10    mean_acc_in_air6     mean_jerk_in_air6     0.999460
11    mean_acc_in_air6    mean_speed_in_air6     0.803366
12  mean_acc_on_paper6   mean_jerk_on_paper6     0.841727
13  mean_acc_on_paper6  mean_speed_on_paper6     0.737166
14          mean_gmrt6    mean_speed_in_air6     0.809440
15          mean_gmrt6  mean_speed_on_paper6     0.774425
16   mean_jerk_in_air6    mean_speed_in_air6     0.800747
17     num_of_pendown6           paper_time6     0.716184
18         paper_time6           total_time6     0.789970


Highly Correlated Feature Pairs in Task7:
               Feature1              Feature2  Correlation
0             air_time7           total_time7     0.995395
1           disp_index7        gmrt_on_paper7    -0.710178
2           disp_index7           paper_time7     0.854556
3          gmrt_in_air7      mean_acc_in_air7     0.790206
4          gmrt_in_air7            mean_gmrt7     0.861947
5          gmrt_in_air7     mean_jerk_in_air7     0.788250
6          gmrt_in_air7    mean_speed_in_air7     0.989095
7        gmrt_on_paper7    mean_acc_on_paper7     0.704073
8        gmrt_on_paper7            mean_gmrt7     0.925679
9        gmrt_on_paper7   mean_jerk_on_paper7     0.757407
10       gmrt_on_paper7  mean_speed_on_paper7     0.960689
11     mean_acc_in_air7     mean_jerk_in_air7     0.999657
12     mean_acc_in_air7    mean_speed_in_air7     0.812783
13   mean_acc_on_paper7   mean_jerk_on_paper7     0.894712
14   mean_acc_on_paper7  mean_speed_on_paper7     0.716698
15           mean_gmrt7    mean_speed_in_air7     0.841244
16           mean_gmrt7  mean_speed_on_paper7     0.885753
17    mean_jerk_in_air7    mean_speed_in_air7     0.809633
18  mean_jerk_on_paper7  mean_speed_on_paper7     0.763888


Highly Correlated Feature Pairs in Task8:
             Feature1              Feature2  Correlation
0           air_time8       num_of_pendown8     0.788848
1           air_time8           total_time8     0.930366
2         disp_index8           paper_time8     0.837786
3         disp_index8           total_time8     0.703056
4        gmrt_in_air8            mean_gmrt8     0.968389
5      gmrt_on_paper8  mean_speed_on_paper8     0.996254
6    mean_acc_in_air8     mean_jerk_in_air8     0.996209
7  mean_acc_on_paper8   mean_jerk_on_paper8     0.817212
8     num_of_pendown8           total_time8     0.802352
9         paper_time8           total_time8     0.821477


Highly Correlated Feature Pairs in Task9:
             Feature1              Feature2  Correlation
0           air_time9       num_of_pendown9     0.808861
1           air_time9           total_time9     0.945315
2        gmrt_in_air9            mean_gmrt9     0.976753
3      gmrt_on_paper9  mean_speed_on_paper9     0.989501
4    mean_acc_in_air9     mean_jerk_in_air9     0.998487
5  mean_acc_on_paper9   mean_jerk_on_paper9     0.792467
6     num_of_pendown9         pressure_var9     0.723058
7     num_of_pendown9           total_time9     0.806161
8         paper_time9           total_time9     0.786542


Highly Correlated Feature Pairs in Task10:
               Feature1               Feature2  Correlation
0            air_time10       num_of_pendown10     0.718506
1            air_time10           paper_time10     0.754052
2            air_time10           total_time10     0.977930
3         gmrt_in_air10            mean_gmrt10     0.910764
4         gmrt_in_air10    mean_speed_in_air10     0.804005
5       gmrt_on_paper10    mean_acc_on_paper10     0.734830
6       gmrt_on_paper10            mean_gmrt10     0.815390
7       gmrt_on_paper10  mean_speed_on_paper10     0.986698
8     mean_acc_in_air10     mean_jerk_in_air10     0.993249
9   mean_acc_on_paper10   mean_jerk_on_paper10     0.838245
10  mean_acc_on_paper10  mean_speed_on_paper10     0.745339
11          mean_gmrt10    mean_speed_in_air10     0.736995
12          mean_gmrt10  mean_speed_on_paper10     0.825203
13     num_of_pendown10           paper_time10     0.711796
14     num_of_pendown10           total_time10     0.756727
15         paper_time10           total_time10     0.874640


Highly Correlated Feature Pairs in Task11:
               Feature1               Feature2  Correlation
0            air_time11           total_time11     0.998010
1         gmrt_in_air11            mean_gmrt11     0.872288
2         gmrt_in_air11    mean_speed_in_air11     0.817690
3       gmrt_on_paper11    mean_acc_on_paper11     0.739347
4       gmrt_on_paper11            mean_gmrt11     0.843760
5       gmrt_on_paper11  mean_speed_on_paper11     0.981935
6     mean_acc_in_air11     mean_jerk_in_air11     0.996682
7   mean_acc_on_paper11   mean_jerk_on_paper11     0.856793
8   mean_acc_on_paper11  mean_speed_on_paper11     0.755491
9           mean_gmrt11    mean_speed_in_air11     0.726196
10          mean_gmrt11  mean_speed_on_paper11     0.843252


Highly Correlated Feature Pairs in Task12:
              Feature1               Feature2  Correlation
0           air_time12           total_time12     0.996805
1        gmrt_in_air12            mean_gmrt12     0.982338
2      gmrt_on_paper12  mean_speed_on_paper12     0.989997
3    mean_acc_in_air12     mean_jerk_in_air12     0.988375
4  mean_acc_on_paper12   mean_jerk_on_paper12     0.857532


Highly Correlated Feature Pairs in Task13:
              Feature1               Feature2  Correlation
0           air_time13           total_time13     0.913617
1        gmrt_in_air13            mean_gmrt13     0.963367
2        gmrt_in_air13    mean_speed_in_air13     0.749360
3      gmrt_on_paper13  mean_speed_on_paper13     0.983036
4    mean_acc_in_air13     mean_jerk_in_air13     0.997881
5    mean_acc_in_air13    mean_speed_in_air13     0.777862
6  mean_acc_on_paper13   mean_jerk_on_paper13     0.804902
7          mean_gmrt13    mean_speed_in_air13     0.748023
8   mean_jerk_in_air13    mean_speed_in_air13     0.766715
9         paper_time13           total_time13     0.838600


Highly Correlated Feature Pairs in Task14:
               Feature1               Feature2  Correlation
0            air_time14           total_time14     0.999728
1          disp_index14           paper_time14     0.756335
2         gmrt_in_air14        gmrt_on_paper14     0.700201
3         gmrt_in_air14            mean_gmrt14     0.957898
4         gmrt_in_air14    mean_speed_in_air14     0.951554
5         gmrt_in_air14  mean_speed_on_paper14     0.705478
6       gmrt_on_paper14      max_x_extension14     0.726850
7       gmrt_on_paper14    mean_acc_on_paper14     0.714079
8       gmrt_on_paper14            mean_gmrt14     0.875701
9       gmrt_on_paper14    mean_speed_in_air14     0.717726
10      gmrt_on_paper14  mean_speed_on_paper14     0.968376
11    mean_acc_in_air14     mean_jerk_in_air14     0.998935
12    mean_acc_in_air14    mean_speed_in_air14     0.706889
13  mean_acc_on_paper14   mean_jerk_on_paper14     0.890177
14  mean_acc_on_paper14  mean_speed_on_paper14     0.743337
15          mean_gmrt14    mean_speed_in_air14     0.932180
16          mean_gmrt14  mean_speed_on_paper14     0.866553
17  mean_speed_in_air14  mean_speed_on_paper14     0.707248


Highly Correlated Feature Pairs in Task15:
                Feature1               Feature2  Correlation
0             air_time15           total_time15     0.995795
1           disp_index15           paper_time15     0.739648
2          gmrt_in_air15            mean_gmrt15     0.812303
3          gmrt_in_air15    mean_speed_in_air15     0.952605
4        gmrt_on_paper15    mean_acc_on_paper15     0.727283
5        gmrt_on_paper15            mean_gmrt15     0.839909
6        gmrt_on_paper15   mean_jerk_on_paper15     0.761560
7        gmrt_on_paper15  mean_speed_on_paper15     0.955322
8      mean_acc_in_air15     mean_jerk_in_air15     0.998358
9      mean_acc_in_air15    mean_speed_in_air15     0.733157
10   mean_acc_on_paper15   mean_jerk_on_paper15     0.887051
11   mean_acc_on_paper15  mean_speed_on_paper15     0.752111
12           mean_gmrt15    mean_speed_in_air15     0.764917
13           mean_gmrt15  mean_speed_on_paper15     0.802560
14    mean_jerk_in_air15    mean_speed_in_air15     0.720893
15  mean_jerk_on_paper15  mean_speed_on_paper15     0.769756


Highly Correlated Feature Pairs in Task16:
               Feature1               Feature2  Correlation
0            air_time16           total_time16     0.974063
1          disp_index16      max_x_extension16     0.919359
2          disp_index16      max_y_extension16     0.844919
3          disp_index16       num_of_pendown16     0.791532
4          disp_index16           paper_time16     0.803024
5         gmrt_in_air16            mean_gmrt16     0.903230
6         gmrt_in_air16    mean_speed_in_air16     0.824760
7       gmrt_on_paper16  mean_speed_on_paper16     0.918512
8     max_x_extension16      max_y_extension16     0.827823
9     max_x_extension16       num_of_pendown16     0.765683
10    max_y_extension16       num_of_pendown16     0.751162
11    mean_acc_in_air16     mean_jerk_in_air16     0.998580
12    mean_acc_in_air16    mean_speed_in_air16     0.723327
13  mean_acc_on_paper16   mean_jerk_on_paper16     0.853240
14          mean_gmrt16    mean_speed_in_air16     0.750478
15   mean_jerk_in_air16    mean_speed_in_air16     0.721484
16     num_of_pendown16           paper_time16     0.728595
17         paper_time16           total_time16     0.750774


Highly Correlated Feature Pairs in Task17:
               Feature1               Feature2  Correlation
0            air_time17           total_time17     0.991785
1         gmrt_in_air17      mean_acc_in_air17     0.920234
2         gmrt_in_air17            mean_gmrt17     0.961986
3         gmrt_in_air17     mean_jerk_in_air17     0.918957
4         gmrt_in_air17    mean_speed_in_air17     0.990772
5       gmrt_on_paper17            mean_gmrt17     0.802489
6       gmrt_on_paper17  mean_speed_on_paper17     0.949224
7     max_x_extension17      max_y_extension17     0.836344
8     mean_acc_in_air17            mean_gmrt17     0.881192
9     mean_acc_in_air17     mean_jerk_in_air17     0.999833
10    mean_acc_in_air17    mean_speed_in_air17     0.924006
11  mean_acc_on_paper17   mean_jerk_on_paper17     0.890039
12          mean_gmrt17     mean_jerk_in_air17     0.880017
13          mean_gmrt17    mean_speed_in_air17     0.965667
14          mean_gmrt17  mean_speed_on_paper17     0.772327
15   mean_jerk_in_air17    mean_speed_in_air17     0.922295


Highly Correlated Feature Pairs in Task18:
               Feature1               Feature2  Correlation
0            air_time18           disp_index18     0.808225
1            air_time18       num_of_pendown18     0.924399
2            air_time18           paper_time18     0.848103
3            air_time18           total_time18     0.985891
4          disp_index18      max_x_extension18     0.811118
5          disp_index18       num_of_pendown18     0.894879
6          disp_index18           paper_time18     0.877213
7          disp_index18           total_time18     0.857403
8         gmrt_in_air18            mean_gmrt18     0.938460
9         gmrt_in_air18    mean_speed_in_air18     0.925410
10      gmrt_on_paper18    mean_acc_on_paper18     0.727088
11      gmrt_on_paper18            mean_gmrt18     0.716782
12      gmrt_on_paper18  mean_speed_on_paper18     0.960014
13    max_x_extension18      max_y_extension18     0.769647
14    max_x_extension18       num_of_pendown18     0.724847
15    mean_acc_in_air18     mean_jerk_in_air18     0.996263
16  mean_acc_on_paper18   mean_jerk_on_paper18     0.842544
17  mean_acc_on_paper18  mean_speed_on_paper18     0.755078
18          mean_gmrt18    mean_speed_in_air18     0.878387
19          mean_gmrt18  mean_speed_on_paper18     0.751618
20     num_of_pendown18           paper_time18     0.886797
21     num_of_pendown18           total_time18     0.943838
22         paper_time18           total_time18     0.924825


Highly Correlated Feature Pairs in Task19:
               Feature1               Feature2  Correlation
0            air_time19           total_time19     0.999992
1         gmrt_in_air19      mean_acc_in_air19     0.741345
2         gmrt_in_air19            mean_gmrt19     0.875684
3         gmrt_in_air19     mean_jerk_in_air19     0.731213
4         gmrt_in_air19    mean_speed_in_air19     0.986371
5       gmrt_on_paper19            mean_gmrt19     0.852113
6       gmrt_on_paper19  mean_speed_on_paper19     0.963996
7     mean_acc_in_air19     mean_jerk_in_air19     0.998665
8     mean_acc_in_air19    mean_speed_in_air19     0.729755
9   mean_acc_on_paper19   mean_jerk_on_paper19     0.854577
10          mean_gmrt19    mean_speed_in_air19     0.841253
11          mean_gmrt19  mean_speed_on_paper19     0.826229
12   mean_jerk_in_air19    mean_speed_in_air19     0.716694


Highly Correlated Feature Pairs in Task20:
              Feature1               Feature2  Correlation
0           air_time20           total_time20     0.978060
1         disp_index20           paper_time20     0.775212
2        gmrt_in_air20            mean_gmrt20     0.923186
3        gmrt_in_air20    mean_speed_in_air20     0.901774
4      gmrt_on_paper20            mean_gmrt20     0.753344
5      gmrt_on_paper20  mean_speed_on_paper20     0.973751
6    mean_acc_in_air20     mean_jerk_in_air20     0.996749
7  mean_acc_on_paper20   mean_jerk_on_paper20     0.873658
8          mean_gmrt20    mean_speed_in_air20     0.882880
9          mean_gmrt20  mean_speed_on_paper20     0.775548


Highly Correlated Feature Pairs in Task21:
               Feature1               Feature2  Correlation
0            air_time21       num_of_pendown21     0.815425
1            air_time21           total_time21     0.844409
2          disp_index21           paper_time21     0.791304
3         gmrt_in_air21            mean_gmrt21     0.988968
4       gmrt_on_paper21    mean_acc_on_paper21    -0.725525
5       gmrt_on_paper21  mean_speed_on_paper21     0.982040
6     max_x_extension21      max_y_extension21     0.861927
7     mean_acc_in_air21     mean_jerk_in_air21     0.999764
8     mean_acc_in_air21    mean_speed_in_air21     0.936375
9   mean_acc_on_paper21   mean_jerk_on_paper21     0.905703
10  mean_acc_on_paper21  mean_speed_on_paper21    -0.736360
11   mean_jerk_in_air21    mean_speed_in_air21     0.934035
12     num_of_pendown21           paper_time21     0.703650
13     num_of_pendown21           total_time21     0.829889
14         paper_time21           total_time21     0.937578


Highly Correlated Feature Pairs in Task22:
              Feature1               Feature2  Correlation
0           air_time22           total_time22     0.999412
1        gmrt_in_air22            mean_gmrt22     0.903649
2        gmrt_in_air22    mean_speed_in_air22     0.961481
3      gmrt_on_paper22            mean_gmrt22     0.902864
4      gmrt_on_paper22  mean_speed_on_paper22     0.978226
5    mean_acc_in_air22     mean_jerk_in_air22     0.992637
6  mean_acc_on_paper22   mean_jerk_on_paper22     0.869012
7          mean_gmrt22    mean_speed_in_air22     0.880084
8          mean_gmrt22  mean_speed_on_paper22     0.884698


Highly Correlated Feature Pairs in Task23:
              Feature1               Feature2  Correlation
0           air_time23           total_time23     0.988343
1        gmrt_in_air23            mean_gmrt23     0.874589
2        gmrt_in_air23    mean_speed_in_air23     0.974215
3      gmrt_on_paper23            mean_gmrt23     0.919647
4      gmrt_on_paper23  mean_speed_on_paper23     0.971879
5    mean_acc_in_air23     mean_jerk_in_air23     0.991724
6  mean_acc_on_paper23   mean_jerk_on_paper23     0.850627
7  mean_acc_on_paper23  mean_speed_on_paper23     0.719526
8          mean_gmrt23    mean_speed_in_air23     0.831196
9          mean_gmrt23  mean_speed_on_paper23     0.907974


Highly Correlated Feature Pairs in Task24:
               Feature1               Feature2  Correlation
0            air_time24           total_time24     0.987239
1         gmrt_in_air24        gmrt_on_paper24     0.775761
2         gmrt_in_air24      mean_acc_in_air24     0.807689
3         gmrt_in_air24            mean_gmrt24     0.930178
4         gmrt_in_air24     mean_jerk_in_air24     0.802715
5         gmrt_in_air24    mean_speed_in_air24     0.983763
6       gmrt_on_paper24            mean_gmrt24     0.953251
7       gmrt_on_paper24    mean_speed_in_air24     0.731777
8       gmrt_on_paper24  mean_speed_on_paper24     0.923057
9     max_x_extension24      max_y_extension24     0.932840
10    mean_acc_in_air24            mean_gmrt24     0.727239
11    mean_acc_in_air24     mean_jerk_in_air24     0.999320
12    mean_acc_in_air24    mean_speed_in_air24     0.820703
13  mean_acc_on_paper24   mean_jerk_on_paper24     0.893046
14          mean_gmrt24     mean_jerk_in_air24     0.724329
15          mean_gmrt24    mean_speed_in_air24     0.896815
16          mean_gmrt24  mean_speed_on_paper24     0.871719
17   mean_jerk_in_air24    mean_speed_in_air24     0.813987


Highly Correlated Feature Pairs in Task25:
               Feature1               Feature2  Correlation
0            air_time25           total_time25     0.999294
1          disp_index25           paper_time25     0.703494
2         gmrt_in_air25      mean_acc_in_air25     0.872965
3         gmrt_in_air25            mean_gmrt25     0.949397
4         gmrt_in_air25     mean_jerk_in_air25     0.871563
5         gmrt_in_air25    mean_speed_in_air25     0.984926
6         gmrt_in_air25  mean_speed_on_paper25     0.708510
7       gmrt_on_paper25            mean_gmrt25     0.879144
8       gmrt_on_paper25  mean_speed_on_paper25     0.964624
9     mean_acc_in_air25            mean_gmrt25     0.839432
10    mean_acc_in_air25     mean_jerk_in_air25     0.999427
11    mean_acc_in_air25    mean_speed_in_air25     0.869102
12  mean_acc_on_paper25   mean_jerk_on_paper25     0.891626
13          mean_gmrt25     mean_jerk_in_air25     0.838844
14          mean_gmrt25    mean_speed_in_air25     0.943679
15          mean_gmrt25  mean_speed_on_paper25     0.879285
16   mean_jerk_in_air25    mean_speed_in_air25     0.864771
17  mean_speed_in_air25  mean_speed_on_paper25     0.718706


Similar Pairs Across Different Tasks:¶
1. air_time & total_time:
    These pairs are likely measuring time-related metrics, possibly in different contexts or phases of tasks (e.g., total time vs. specific time in the air).

2. disp_index & paper_time:
    disp_index and paper_time could be related to indices or metrics related to paper tasks or measurements taken on paper.

3. gmrt_in_air & mean_gmrt:
    gmrt_in_air and mean_gmrt seem to relate to metrics involving GMRT (possibly Global Mean Response Time), measured either in the air or as an average across tasks.

4. gmrt_on_paper & mean_speed_on_paper:
    gmrt_on_paper and mean_speed_on_paper might indicate GMRT or speed-related metrics specifically measured or averaged on paper.

5. mean_acc_in_air & mean_jerk_in_air:
    mean_acc_in_air and mean_jerk_in_air likely represent mean acceleration and mean jerk measured during tasks in the air, indicating movement or dynamic metrics.

6. mean_acc_on_paper & mean_jerk_on_paper:
    mean_acc_on_paper and mean_jerk_on_paper similarly suggest mean acceleration and mean jerk metrics, but specifically measured or averaged during tasks on paper.

Run VIF for each task individually¶

In [22]:
vif_results = {}

# Iterate through each task
for task_name, task_data in tasks.items():
    # Add constant to the task data
    X_num = sm.add_constant(task_data)
    
    # Calculate VIF
    vif = pd.DataFrame()
    vif["Features"] = X_num.columns
    vif["VIF"] = [variance_inflation_factor(X_num.values, i) for i in range(X_num.shape[1])]
    
    # Store VIF results for the current task
    vif_results[task_name] = vif

# Print or use vif_results as needed
for task_name, vif_result in vif_results.items():
    print(f"VIF for {task_name}:")
    print(vif_result)
    print("\n")
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
VIF for Task1:
                Features           VIF
0                  const  1.105667e+02
1              air_time1  3.753000e+14
2            disp_index1  7.928363e+00
3           gmrt_in_air1           inf
4         gmrt_on_paper1           inf
5       max_x_extension1  1.324342e+00
6       max_y_extension1  3.573801e+00
7       mean_acc_in_air1  8.697292e+01
8     mean_acc_on_paper1  1.549472e+01
9             mean_gmrt1           inf
10     mean_jerk_in_air1  7.490932e+01
11   mean_jerk_on_paper1  6.111167e+00
12    mean_speed_in_air1  8.744612e+00
13  mean_speed_on_paper1  6.005949e+01
14       num_of_pendown1  3.056514e+00
15           paper_time1  1.841963e+13
16        pressure_mean1  1.344270e+00
17         pressure_var1  1.529629e+00
18           total_time1  1.286743e+15


VIF for Task2:
                Features           VIF
0                  const  9.704804e+01
1              air_time2  1.622919e+13
2            disp_index2  4.706975e+00
3           gmrt_in_air2           inf
4         gmrt_on_paper2           inf
5       max_x_extension2  1.725799e+00
6       max_y_extension2  1.393634e+00
7       mean_acc_in_air2  1.100280e+02
8     mean_acc_on_paper2  5.180047e+00
9             mean_gmrt2           inf
10     mean_jerk_in_air2  1.080608e+02
11   mean_jerk_on_paper2  4.420443e+00
12    mean_speed_in_air2  2.963337e+00
13  mean_speed_on_paper2  9.264403e+00
14       num_of_pendown2  6.865570e+00
15           paper_time2  4.816684e+13
16        pressure_mean2  1.433557e+00
17         pressure_var2  2.071652e+00
18           total_time2  2.918730e+12


VIF for Task3:
                Features           VIF
0                  const  6.880673e+01
1              air_time3  9.790434e+13
2            disp_index3  5.629994e+00
3           gmrt_in_air3           inf
4         gmrt_on_paper3           inf
5       max_x_extension3  1.885585e+00
6       max_y_extension3  1.690008e+00
7       mean_acc_in_air3  1.892925e+02
8     mean_acc_on_paper3  5.315598e+00
9             mean_gmrt3           inf
10     mean_jerk_in_air3  1.833855e+02
11   mean_jerk_on_paper3  3.959730e+00
12    mean_speed_in_air3  2.978821e+00
13  mean_speed_on_paper3  1.198944e+01
14       num_of_pendown3  4.341671e+00
15           paper_time3  1.094435e+13
16        pressure_mean3  1.836625e+00
17         pressure_var3  2.275786e+00
18           total_time3  2.001600e+14


VIF for Task4:
                Features           VIF
0                  const  2.949786e+02
1              air_time4  4.094181e+14
2            disp_index4  5.336681e+00
3           gmrt_in_air4           inf
4         gmrt_on_paper4           inf
5       max_x_extension4  3.812092e+00
6       max_y_extension4  3.434112e+00
7       mean_acc_in_air4  1.598819e+02
8     mean_acc_on_paper4  5.585926e+00
9             mean_gmrt4           inf
10     mean_jerk_in_air4  1.477640e+02
11   mean_jerk_on_paper4  5.572271e+00
12    mean_speed_in_air4  5.820308e+00
13  mean_speed_on_paper4  8.843293e+01
14       num_of_pendown4  2.924574e+00
15           paper_time4  3.216857e+14
16        pressure_mean4  1.710513e+00
17         pressure_var4  2.185169e+00
18           total_time4  7.505999e+14


VIF for Task5:
                Features           VIF
0                  const  8.687105e+01
1              air_time5  1.272203e+13
2            disp_index5  7.741576e+00
3           gmrt_in_air5           inf
4         gmrt_on_paper5           inf
5       max_x_extension5  4.658236e+00
6       max_y_extension5  8.930549e+00
7       mean_acc_in_air5  2.831899e+01
8     mean_acc_on_paper5  5.299003e+00
9             mean_gmrt5           inf
10     mean_jerk_in_air5  2.399312e+01
11   mean_jerk_on_paper5  5.513568e+00
12    mean_speed_in_air5  3.830815e+00
13  mean_speed_on_paper5  1.211562e+02
14       num_of_pendown5  2.841678e+00
15           paper_time5  2.850380e+13
16        pressure_mean5  1.798498e+00
17         pressure_var5  1.807462e+00
18           total_time5  1.983965e+13


VIF for Task6:
                Features           VIF
0                  const  9.853068e+01
1              air_time6           inf
2            disp_index6  4.241558e+00
3           gmrt_in_air6           inf
4         gmrt_on_paper6           inf
5       max_x_extension6  1.348731e+00
6       max_y_extension6  1.645329e+00
7       mean_acc_in_air6  1.155542e+03
8     mean_acc_on_paper6  5.847207e+00
9             mean_gmrt6           inf
10     mean_jerk_in_air6  1.142259e+03
11   mean_jerk_on_paper6  5.108992e+00
12    mean_speed_in_air6  1.890550e+01
13  mean_speed_on_paper6  3.247801e+01
14       num_of_pendown6  3.598965e+00
15           paper_time6  2.573486e+14
16        pressure_mean6  1.255557e+00
17         pressure_var6  1.377617e+00
18           total_time6  3.768703e+13


VIF for Task7:
                Features           VIF
0                  const  8.494999e+02
1              air_time7  6.928615e+14
2            disp_index7  6.974183e+00
3           gmrt_in_air7           inf
4         gmrt_on_paper7           inf
5       max_x_extension7  1.610340e+00
6       max_y_extension7  1.372123e+00
7       mean_acc_in_air7  1.839324e+03
8     mean_acc_on_paper7  6.349595e+00
9             mean_gmrt7           inf
10     mean_jerk_in_air7  1.801752e+03
11   mean_jerk_on_paper7  6.963139e+00
12    mean_speed_in_air7  6.869568e+01
13  mean_speed_on_paper7  2.106457e+01
14       num_of_pendown7  2.280263e+00
15           paper_time7  2.037828e+13
16        pressure_mean7  1.290672e+00
17         pressure_var7  1.295256e+00
18           total_time7  2.537239e+13


VIF for Task8:
                Features           VIF
0                  const  1.245849e+02
1              air_time8  2.370316e+14
2            disp_index8  7.365519e+00
3           gmrt_in_air8           inf
4         gmrt_on_paper8           inf
5       max_x_extension8  2.881130e+00
6       max_y_extension8  3.467209e+00
7       mean_acc_in_air8  1.427206e+02
8     mean_acc_on_paper8  3.798652e+00
9             mean_gmrt8           inf
10     mean_jerk_in_air8  1.405687e+02
11   mean_jerk_on_paper8  4.111470e+00
12    mean_speed_in_air8  3.143658e+00
13  mean_speed_on_paper8  2.348250e+02
14       num_of_pendown8  5.805948e+00
15           paper_time8  2.914951e+13
16        pressure_mean8  1.417820e+00
17         pressure_var8  1.801673e+00
18           total_time8  3.216857e+14


VIF for Task9:
                Features           VIF
0                  const  1.652827e+02
1              air_time9  1.358552e+13
2            disp_index9  6.093110e+00
3           gmrt_in_air9           inf
4         gmrt_on_paper9           inf
5       max_x_extension9  3.770112e+00
6       max_y_extension9  4.416594e+00
7       mean_acc_in_air9  4.141213e+02
8     mean_acc_on_paper9  3.375279e+00
9             mean_gmrt9           inf
10     mean_jerk_in_air9  4.021643e+02
11   mean_jerk_on_paper9  3.905194e+00
12    mean_speed_in_air9  4.085624e+00
13  mean_speed_on_paper9  1.540690e+02
14       num_of_pendown9  6.053427e+00
15           paper_time9  3.216857e+14
16        pressure_mean9  1.914346e+00
17         pressure_var9  2.644888e+00
18           total_time9  4.503600e+14


VIF for Task10:
                 Features           VIF
0                   const  1.182665e+02
1              air_time10  8.578285e+13
2            disp_index10  4.446180e+00
3           gmrt_in_air10           inf
4         gmrt_on_paper10           inf
5       max_x_extension10  4.187501e+00
6       max_y_extension10  3.759179e+00
7       mean_acc_in_air10  8.702232e+01
8     mean_acc_on_paper10  9.031832e+00
9             mean_gmrt10           inf
10     mean_jerk_in_air10  8.350087e+01
11   mean_jerk_on_paper10  5.244241e+00
12    mean_speed_in_air10  5.492209e+00
13  mean_speed_on_paper10  8.183551e+01
14       num_of_pendown10  4.846591e+00
15           paper_time10  3.102721e+12
16        pressure_mean10  1.469359e+00
17         pressure_var10  2.000801e+00
18           total_time10  1.344358e+14


VIF for Task11:
                 Features           VIF
0                   const  1.521792e+02
1              air_time11  1.286743e+15
2            disp_index11  3.094205e+00
3           gmrt_in_air11           inf
4         gmrt_on_paper11           inf
5       max_x_extension11  1.799007e+00
6       max_y_extension11  2.801996e+00
7       mean_acc_in_air11  1.813071e+02
8     mean_acc_on_paper11  7.701560e+00
9             mean_gmrt11           inf
10     mean_jerk_in_air11  1.738292e+02
11   mean_jerk_on_paper11  5.456851e+00
12    mean_speed_in_air11  5.067461e+00
13  mean_speed_on_paper11  6.059699e+01
14       num_of_pendown11  2.907528e+00
15           paper_time11  3.721983e+13
16        pressure_mean11  1.358360e+00
17         pressure_var11  1.424200e+00
18           total_time11  4.503600e+15


VIF for Task12:
                 Features           VIF
0                   const  8.049717e+01
1              air_time12  1.125900e+15
2            disp_index12  5.447076e+00
3           gmrt_in_air12           inf
4         gmrt_on_paper12           inf
5       max_x_extension12  3.211526e+00
6       max_y_extension12  5.092104e+00
7       mean_acc_in_air12  6.246471e+01
8     mean_acc_on_paper12  5.418308e+00
9             mean_gmrt12           inf
10     mean_jerk_in_air12  5.732546e+01
11   mean_jerk_on_paper12  6.272319e+00
12    mean_speed_in_air12  3.686338e+00
13  mean_speed_on_paper12  9.989373e+01
14       num_of_pendown12  3.089404e+00
15           paper_time12  2.434378e+14
16        pressure_mean12  1.563005e+00
17         pressure_var12  1.832552e+00
18           total_time12  1.668000e+14


VIF for Task13:
                 Features           VIF
0                   const  1.584171e+02
1              air_time13  4.228732e+13
2            disp_index13  4.904570e+00
3           gmrt_in_air13           inf
4         gmrt_on_paper13           inf
5       max_x_extension13  3.491735e+00
6       max_y_extension13  4.665869e+00
7       mean_acc_in_air13  3.003368e+02
8     mean_acc_on_paper13  4.274424e+00
9             mean_gmrt13           inf
10     mean_jerk_in_air13  2.879399e+02
11   mean_jerk_on_paper13  4.966804e+00
12    mean_speed_in_air13  6.045759e+00
13  mean_speed_on_paper13  1.167574e+02
14       num_of_pendown13  3.887106e+00
15           paper_time13  1.286743e+14
16        pressure_mean13  1.330342e+00
17         pressure_var13  1.794759e+00
18           total_time13  1.073564e+13


VIF for Task14:
                 Features           VIF
0                   const  8.094768e+01
1              air_time14           inf
2            disp_index14  6.825483e+00
3           gmrt_in_air14           inf
4         gmrt_on_paper14           inf
5       max_x_extension14  1.089854e+01
6       max_y_extension14  4.578782e+00
7       mean_acc_in_air14  6.499061e+02
8     mean_acc_on_paper14  1.098977e+01
9             mean_gmrt14           inf
10     mean_jerk_in_air14  6.298104e+02
11   mean_jerk_on_paper14  8.400442e+00
12    mean_speed_in_air14  1.725920e+01
13  mean_speed_on_paper14  4.798802e+01
14       num_of_pendown14  3.901245e+00
15           paper_time14  2.943529e+13
16        pressure_mean14  1.586477e+00
17         pressure_var14  1.672349e+00
18           total_time14  1.501200e+15


VIF for Task15:
                 Features           VIF
0                   const  1.323029e+02
1              air_time15  3.464307e+14
2            disp_index15  4.014468e+00
3           gmrt_in_air15           inf
4         gmrt_on_paper15           inf
5       max_x_extension15  2.624357e+00
6       max_y_extension15  3.753577e+00
7       mean_acc_in_air15  4.105828e+02
8     mean_acc_on_paper15  6.274418e+00
9             mean_gmrt15           inf
10     mean_jerk_in_air15  3.979800e+02
11   mean_jerk_on_paper15  7.844200e+00
12    mean_speed_in_air15  1.940283e+01
13  mean_speed_on_paper15  2.212481e+01
14       num_of_pendown15  3.149710e+00
15           paper_time15  7.832347e+13
16        pressure_mean15  1.312124e+00
17         pressure_var15  1.336161e+00
18           total_time15  1.000800e+15


VIF for Task16:
                 Features           VIF
0                   const  7.712815e+01
1              air_time16  1.544974e+13
2            disp_index16  1.857458e+01
3           gmrt_in_air16           inf
4         gmrt_on_paper16           inf
5       max_x_extension16  1.386369e+01
6       max_y_extension16  7.890255e+00
7       mean_acc_in_air16  4.199157e+02
8     mean_acc_on_paper16  5.057406e+00
9             mean_gmrt16           inf
10     mean_jerk_in_air16  4.164022e+02
11   mean_jerk_on_paper16  4.521262e+00
12    mean_speed_in_air16  6.617539e+00
13  mean_speed_on_paper16  1.551746e+01
14       num_of_pendown16  5.596677e+00
15           paper_time16  1.407375e+14
16        pressure_mean16  1.369743e+00
17         pressure_var16  1.484131e+00
18           total_time16  3.832851e+13


VIF for Task17:
                 Features           VIF
0                   const  2.832389e+02
1              air_time17  3.105931e+14
2            disp_index17  8.301257e+00
3           gmrt_in_air17           inf
4         gmrt_on_paper17           inf
5       max_x_extension17  5.030863e+00
6       max_y_extension17  6.539413e+00
7       mean_acc_in_air17  3.577205e+03
8     mean_acc_on_paper17  7.362602e+00
9             mean_gmrt17           inf
10     mean_jerk_in_air17  3.511993e+03
11   mean_jerk_on_paper17  8.268067e+00
12    mean_speed_in_air17  9.230383e+01
13  mean_speed_on_paper17  2.514241e+01
14       num_of_pendown17  5.104453e+00
15           paper_time17  4.647678e+12
16        pressure_mean17  1.570213e+00
17         pressure_var17  1.806342e+00
18           total_time17  1.507986e+12


VIF for Task18:
                 Features           VIF
0                   const  7.116426e+01
1              air_time18  4.003200e+13
2            disp_index18  1.309099e+01
3           gmrt_in_air18           inf
4         gmrt_on_paper18           inf
5       max_x_extension18  5.855448e+00
6       max_y_extension18  4.270972e+00
7       mean_acc_in_air18  1.845349e+02
8     mean_acc_on_paper18  8.385470e+00
9             mean_gmrt18           inf
10     mean_jerk_in_air18  1.734955e+02
11   mean_jerk_on_paper18  5.107061e+00
12    mean_speed_in_air18  1.128941e+01
13  mean_speed_on_paper18  2.706164e+01
14       num_of_pendown18  1.867973e+01
15           paper_time18  2.309538e+14
16        pressure_mean18  1.400963e+00
17         pressure_var18  1.394921e+00
18           total_time18  6.928615e+13


VIF for Task19:
                 Features           VIF
0                   const  3.639880e+02
1              air_time19           inf
2            disp_index19  5.236016e+00
3           gmrt_in_air19           inf
4         gmrt_on_paper19           inf
5       max_x_extension19  1.962953e+00
6       max_y_extension19  1.640485e+00
7       mean_acc_in_air19  5.289859e+02
8     mean_acc_on_paper19  7.604564e+00
9             mean_gmrt19           inf
10     mean_jerk_in_air19  5.137661e+02
11   mean_jerk_on_paper19  1.019701e+01
12    mean_speed_in_air19  1.199893e+02
13  mean_speed_on_paper19  2.593118e+01
14       num_of_pendown19  4.254184e+00
15           paper_time19  1.324588e+14
16        pressure_mean19  1.895993e+00
17         pressure_var19  1.746712e+00
18           total_time19           inf


VIF for Task20:
                 Features           VIF
0                   const  1.682869e+02
1              air_time20  4.094181e+14
2            disp_index20  7.056759e+00
3           gmrt_in_air20           inf
4         gmrt_on_paper20           inf
5       max_x_extension20  3.484185e+00
6       max_y_extension20  2.816101e+00
7       mean_acc_in_air20  2.397273e+02
8     mean_acc_on_paper20  7.461168e+00
9             mean_gmrt20           inf
10     mean_jerk_in_air20  2.278487e+02
11   mean_jerk_on_paper20  7.019694e+00
12    mean_speed_in_air20  1.718689e+01
13  mean_speed_on_paper20  3.695920e+01
14       num_of_pendown20  5.265015e+00
15           paper_time20  2.251800e+14
16        pressure_mean20  1.698360e+00
17         pressure_var20  1.843513e+00
18           total_time20  1.233863e+14


VIF for Task21:
                 Features           VIF
0                   const  3.580011e+02
1              air_time21  6.721790e+13
2            disp_index21  1.274099e+01
3           gmrt_in_air21           inf
4         gmrt_on_paper21           inf
5       max_x_extension21  5.584915e+00
6       max_y_extension21  5.399551e+00
7       mean_acc_in_air21  2.848989e+03
8     mean_acc_on_paper21  1.029701e+01
9             mean_gmrt21           inf
10     mean_jerk_in_air21  2.741564e+03
11   mean_jerk_on_paper21  9.135640e+00
12    mean_speed_in_air21  1.159173e+01
13  mean_speed_on_paper21  5.017173e+01
14       num_of_pendown21  5.968672e+00
15           paper_time21  2.690322e+12
16        pressure_mean21  2.499224e+00
17         pressure_var21  3.732340e+00
18           total_time21  1.941207e+13


VIF for Task22:
                 Features           VIF
0                   const  3.204558e+02
1              air_time22           inf
2            disp_index22  5.860487e+00
3           gmrt_in_air22           inf
4         gmrt_on_paper22           inf
5       max_x_extension22  2.872050e+00
6       max_y_extension22  2.671489e+00
7       mean_acc_in_air22  7.719101e+01
8     mean_acc_on_paper22  6.582674e+00
9             mean_gmrt22           inf
10     mean_jerk_in_air22  7.629409e+01
11   mean_jerk_on_paper22  7.539692e+00
12    mean_speed_in_air22  2.115548e+01
13  mean_speed_on_paper22  3.407063e+01
14       num_of_pendown22  2.417024e+00
15           paper_time22  7.970973e+13
16        pressure_mean22  1.533381e+00
17         pressure_var22  1.530904e+00
18           total_time22           inf


VIF for Task23:
                 Features           VIF
0                   const  2.562240e+02
1              air_time23  8.188363e+14
2            disp_index23  3.804280e+00
3           gmrt_in_air23           inf
4         gmrt_on_paper23           inf
5       max_x_extension23  2.540148e+00
6       max_y_extension23  3.855874e+00
7       mean_acc_in_air23  7.356635e+01
8     mean_acc_on_paper23  7.190413e+00
9             mean_gmrt23           inf
10     mean_jerk_in_air23  7.130591e+01
11   mean_jerk_on_paper23  6.147819e+00
12    mean_speed_in_air23  2.879408e+01
13  mean_speed_on_paper23  3.586943e+01
14       num_of_pendown23  2.098031e+00
15           paper_time23  2.047091e+13
16        pressure_mean23  1.373074e+00
17         pressure_var23  1.249327e+00
18           total_time23  1.286743e+15


VIF for Task24:
                 Features           VIF
0                   const  1.599396e+02
1              air_time24  2.370316e+14
2            disp_index24  7.650013e+00
3           gmrt_in_air24           inf
4         gmrt_on_paper24           inf
5       max_x_extension24  1.127002e+01
6       max_y_extension24  9.385270e+00
7       mean_acc_in_air24  1.012693e+03
8     mean_acc_on_paper24  8.138784e+00
9             mean_gmrt24           inf
10     mean_jerk_in_air24  9.887079e+02
11   mean_jerk_on_paper24  1.079159e+01
12    mean_speed_in_air24  5.749085e+01
13  mean_speed_on_paper24  1.176921e+01
14       num_of_pendown24  2.766528e+00
15           paper_time24  5.629500e+14
16        pressure_mean24  1.595625e+00
17         pressure_var24  1.476198e+00
18           total_time24  5.088813e+13


VIF for Task25:
                 Features           VIF
0                   const  3.906111e+02
1              air_time25  3.002400e+15
2            disp_index25  8.061904e+00
3           gmrt_in_air25           inf
4         gmrt_on_paper25           inf
5       max_x_extension25  3.072491e+00
6       max_y_extension25  2.214602e+00
7       mean_acc_in_air25  1.586619e+03
8     mean_acc_on_paper25  7.287828e+00
9             mean_gmrt25           inf
10     mean_jerk_in_air25  1.578315e+03
11   mean_jerk_on_paper25  6.826671e+00
12    mean_speed_in_air25  7.074528e+01
13  mean_speed_on_paper25  2.947896e+01
14       num_of_pendown25  3.741947e+00
15           paper_time25  1.085205e+13
16        pressure_mean25  1.451165e+00
17         pressure_var25  1.856188e+00
18           total_time25  2.251800e+14


//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)

The VIF calculations are showing some very high values, particularly for variables related to gmrt_in_air, gmrt_on_paper, and mean_gmrt, where VIFs are reported as infinity due to perfect multicollinearity.

The goal is to reduce multicollinearity while preserving as much useful information as possible. We'll focus on removing redundant features and potentially creating new features to capture the essence of the highly correlated variables.

Time-Related Features:

Keep: total_time for each task (captures overall task duration)
Remove: air_time and paper_time (highly correlated with total_time)
Feature Engineering:
Air-to-Paper Ratio: air_time / paper_time (measures relative time in air vs. on paper)

GMRT Features:

Primary GMRT Feature: Keep mean_gmrt for each task, as it represents the average global mean relative tremor.
Redundant GMRT Features: Remove gmrt_in_air and gmrt_on_paper since they contribute heavily to the mean_gmrt.

Speed Features:

Keep: mean_speed_on_paper for each task (directly measures writing speed)
Remove: mean_speed_in_air (less directly related to handwriting quality)

Movement Smoothness Features:

Keep: mean_jerk_in_air and mean_jerk_on_paper (captures smoothness of movement)
Remove: mean_acc_in_air and mean_acc_on_paper (highly correlated with jerk features)

Pressure Features:

Remove: pressure_mean and pressure_var (capture overall pressure and variation)
Feature Engineering:
Pressure Variation Index: pressure_var / pressure_mean (measures pressure fluctuation)

Spatial Features:

Keep: disp_index (measures how much of the paper is used)
Remove: max_x_extension and max_y_extension (measure writing extent)
Feature Engineering:
Writing Area: max_x_extension * max_y_extension (captures the overall area covered)

Pendowns Number:

Keep: num_of_pendown (measures how many times the pen touches the paper)
In [23]:
# Remove rows from your dataset X where any column has the value "0.000000"

# Convert "0.000000" to numeric 0
df = df.replace("0.000000", 0)

# Check if any column has 0 and filter rows
df = df[(df != 0).all(axis=1)]

Missing values¶

In [24]:
print(f"There are {sum(df.isnull().sum() > 0)} missing values.")
There are 0 missing values.

Compute the Mahalanobis distance and detect outliers¶

In [25]:
def mahalanobis_distance(data):

    data_array = data.to_numpy()
    
    mean_vector = np.mean(data_array, axis=0)
    
    cov_matrix = np.cov(data_array, rowvar=False)
    
    cov_inv = np.linalg.inv(cov_matrix)
    
    diff = data_array - mean_vector
    mahalanobis_dist = np.sqrt(np.sum(np.dot(diff, cov_inv) * diff, axis=1))
    
    return mahalanobis_dist

def detect_outliers(data, threshold=0.95):
    
    mahalanobis_dist = mahalanobis_distance(data)
    
    chi2_threshold = chi2.ppf(threshold, df=data.shape[1])
    
    outlier_indices = np.where(mahalanobis_dist > chi2_threshold)[0]
    
    return outlier_indices

outlier_indices = detect_outliers(df.iloc[:,:-1])

# Print indices of outlier data points
print("Indices of outlier data points:", outlier_indices)

print("Outlier percentages (%): ", len(outlier_indices)/len(df)*100)
Indices of outlier data points: []
Outlier percentages (%):  0.0
/var/folders/kt/rg6_d5c90v16qpfm3hltsfb80000gn/T/ipykernel_6608/3364540850.py:12: RuntimeWarning: invalid value encountered in sqrt
  mahalanobis_dist = np.sqrt(np.sum(np.dot(diff, cov_inv) * diff, axis=1))
In [27]:
df['class'] = df['class'].replace({'H': 0, 'P': 1})

print(df['class'].value_counts())
class
1    83
0    82
Name: count, dtype: int64
/var/folders/kt/rg6_d5c90v16qpfm3hltsfb80000gn/T/ipykernel_6608/3161964165.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['class'] = df['class'].replace({'H': 0, 'P': 1})
In [28]:
clean_df = df
clean_df
Out[28]:
air_time1 disp_index1 gmrt_in_air1 gmrt_on_paper1 max_x_extension1 max_y_extension1 mean_acc_in_air1 mean_acc_on_paper1 mean_gmrt1 mean_jerk_in_air1 ... mean_jerk_in_air25 mean_jerk_on_paper25 mean_speed_in_air25 mean_speed_on_paper25 num_of_pendown25 paper_time25 pressure_mean25 pressure_var25 total_time25 class
0 5160 0.000013 120.804174 86.853334 957 6601 0.361800 0.217459 103.828754 0.051836 ... 0.141434 0.024471 5.596487 3.184589 71 40120 1749.278166 296102.7676 144605 1
1 51980 0.000016 115.318238 83.448681 1694 6998 0.272513 0.144880 99.383459 0.039827 ... 0.049663 0.018368 1.665973 0.950249 129 126700 1504.768272 278744.2850 298640 1
3 2130 0.000010 369.403342 183.193104 1756 8159 0.556879 0.164502 276.298223 0.090408 ... 0.113905 0.019860 4.206746 1.613522 123 67945 1465.843329 230184.7154 181220 1
4 2310 0.000007 257.997131 111.275889 987 4732 0.266077 0.145104 184.636510 0.037528 ... 0.121782 0.020872 3.319036 1.680629 92 37285 1841.702561 158290.0255 72575 1
5 1920 0.000011 199.764957 109.902254 1548 6260 0.212523 0.143013 154.833606 0.028369 ... 0.131135 0.018907 3.643543 1.667827 76 43790 1081.054579 152045.4446 74605 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
169 2930 0.000010 241.736477 176.115957 1839 6439 0.253347 0.174663 208.926217 0.032691 ... 0.119152 0.020909 4.508709 2.233198 96 44545 1798.923336 247448.3108 80335 0
170 2140 0.000009 274.728964 234.495802 2053 8487 0.225537 0.174920 254.612383 0.032059 ... 0.174495 0.017640 4.685573 2.806888 84 37560 1725.619941 160664.6464 345835 0
171 3830 0.000008 151.536989 171.104693 1287 7352 0.165480 0.161058 161.320841 0.022705 ... 0.114472 0.017194 3.493815 2.510601 88 51675 1915.573488 128727.1241 83445 0
172 1760 0.000008 289.518195 196.411138 1674 6946 0.518937 0.202613 242.964666 0.090686 ... 0.114472 0.017194 3.493815 2.510601 88 51675 1915.573488 128727.1241 83445 0
173 2875 0.000008 235.769350 178.208024 1838 6560 0.567311 0.147818 206.988687 0.099555 ... 0.114472 0.017194 3.493815 2.510601 88 51675 1915.573488 128727.1241 83445 0

165 rows × 451 columns

In [32]:
# Save the DataFrame 'clean_df' to a CSV file named 'clean_df.csv'
clean_df.to_csv('clean_df.csv', index=False)
In [ ]: